[WIP]Use unaligned read/writes for `core::mem::swap` on x86_64 #98892

AngelicosPhosphoros · 2022-07-04T14:07:55Z

This generates better ASM: https://godbolt.org/z/Mr4rWfoad
And misaligned accesses on modern x86_64 processors are fast (see docs for core::mem::swap_chunked).

Main difference is that swapping #[repr(packed)] or aligned to 1 byte types now uses bigger chunks by utilizing movq and movl instructions.

Also, bigger types (e.g. bigger or equal to XMM register) would use SIMD more effectively. Old code used them in not very effecient way, copying data to register and storing it in stack, then reading it back. It caused unneccessary memory reads and writes and completely removed benefits from SSE because number of instructions was similar to number of instructions for simple usize chunked swapping. New code instead stores temporary SIMD chunks entirely in registers by employing eiter 4 XMM registers, 2 YMM registers or 2 XMM registers depending on type size and compiler flags.

Also, made size limit in condition for choosing chunked swap smaller to make types like std::vec::Vec<T> (especially std::str::String) use new optimizations.

rustbot · 2022-07-04T14:07:58Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

compiler-errors · 2022-07-04T15:15:46Z

@bors try @rust-timer queue

rust-timer · 2022-07-04T15:15:47Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-07-04T15:15:54Z

⌛ Trying commit 4c9609245af2160d5708ac0c07610e7fc8e562bf with merge 86189bb1747c007986340e760682fc32e092f136...

bors · 2022-07-04T16:43:19Z

☀️ Try build successful - checks-actions
Build commit: 86189bb1747c007986340e760682fc32e092f136 (86189bb1747c007986340e760682fc32e092f136)

rust-timer · 2022-07-04T16:43:20Z

Queued 86189bb1747c007986340e760682fc32e092f136 with parent a3beeaa, future comparison URL.

rust-timer · 2022-07-05T07:07:56Z

Finished benchmarking commit (86189bb1747c007986340e760682fc32e092f136): comparison url.

Instruction count

Primary benchmarks: mixed results
Secondary benchmarks: 🎉 relevant improvements found

	mean¹	max	count²
Regressions 😿 (primary)	0.6%	1.3%	13
Regressions 😿 (secondary)	N/A	N/A	0
Improvements 🎉 (primary)	-0.5%	-0.9%	11
Improvements 🎉 (secondary)	-1.4%	-2.3%	11
All 😿🎉 (primary)	0.1%	1.3%	24

Max RSS (memory usage)

Results

Primary benchmarks: mixed results
Secondary benchmarks: mixed results

	mean¹	max	count²
Regressions 😿 (primary)	4.3%	4.3%	1
Regressions 😿 (secondary)	3.5%	4.9%	2
Improvements 🎉 (primary)	-7.3%	-11.8%	2
Improvements 🎉 (secondary)	-2.4%	-2.9%	2
All 😿🎉 (primary)	-3.5%	-11.8%	3

Cycles

Results

Primary benchmarks: 😿 relevant regression found
Secondary benchmarks: mixed results

	mean¹	max	count²
Regressions 😿 (primary)	3.9%	3.9%	1
Regressions 😿 (secondary)	3.4%	3.6%	2
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	-3.4%	-5.1%	10
All 😿🎉 (primary)	3.9%	3.9%	1

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

the arithmetic mean of the percent change ↩ ↩² ↩³
number of relevant changes ↩ ↩² ↩³

scottmcm · 2023-01-14T01:46:42Z

This now has conflicts, and still says "WIP" in the name, but let me flip it to a pair of team eyes for feedback first
r? @cuviper

cuviper

This looks mostly good, but I'd like to see it rebased so we can run a current perf test.

cuviper · 2023-02-20T20:17:21Z

library/core/src/mem/mod.rs

+    }
+
+    // Trigger static assertion.
+    let _ = <T as ConstifyOffsets>::STATIC_TEST;


Rather than playing games with overflow, this can just be a const { assert!(...); } block.

Yes, this is better. Updated.

cuviper · 2023-03-03T01:11:14Z

@rustbot author

AngelicosPhosphoros · 2023-05-02T17:50:52Z

@rustbot label -S-waiting-on-author +S-waiting-on-review

@cuviper @scottmcm
I cherry-picked PR on top of current master and fixed requested changes.
Can we rerun perf please?

bors · 2023-05-16T01:46:23Z

☀️ Try build successful - checks-actions
Build commit: cd4875ef17561bd9077809b8d65fcf4e5b439b9e (cd4875ef17561bd9077809b8d65fcf4e5b439b9e)

bors · 2023-05-16T01:46:23Z

☀️ Try build successful - checks-actions
Build commit: cd4875ef17561bd9077809b8d65fcf4e5b439b9e (cd4875ef17561bd9077809b8d65fcf4e5b439b9e)

rust-timer · 2023-05-16T03:01:02Z

Finished benchmarking commit (cd4875ef17561bd9077809b8d65fcf4e5b439b9e): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.2%	[0.1%, 4.6%]	3
Regressions ❌ (secondary)	1.8%	[1.4%, 2.1%]	2
Improvements ✅ (primary)	-2.4%	[-2.4%, -2.4%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.0%	[-2.4%, 4.6%]	4

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	6.1%	[5.5%, 6.7%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.0%, 1.0%]	37
Regressions ❌ (secondary)	0.1%	[0.1%, 0.1%]	1
Improvements ✅ (primary)	-0.1%	[-0.2%, -0.0%]	7
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.1%	[-0.2%, 1.0%]	44

Bootstrap: 644.092s -> 643.294s (-0.12%)

This generates better ASM: https://godbolt.org/z/Mr4rWfoad And misaligned accesses on modern x86_64 processors are fast (see docs for `core::mem::swap_chunked`). Main difference is that swapping `#[repr(packed)]` or aligned to 1 byte types now uses bigger chunks by utilizing `movq` and `movl` instructions. Also, bigger types (e.g. bigger or equal to XMM register) would use SIMD more effectively. Old code used them in not very effecient way, copying data to register and storing it in stack, then reading it back. It caused unneccessary memory reads and writes and completely removed benefits from SSE because number of instructions was similar to number of instructions for simple `usize` chunked swapping. New code instead stores temporary SIMD chunks entirely in registers by employing eiter 4 XMM registers, 2 YMM registers or 2 XMM registers depending on type size and compiler flags. Also, made size limit in condition for choosing chunked swap smaller to make types like `std::vec::Vec<T>` (especially `std::str::String`) use new optimizations.

AngelicosPhosphoros · 2023-05-16T10:27:04Z

@scottmcm Last perf run looks promising IMHO.

What do you think about using cfg!(miri) and const_eval_select to avoid using swapped version in const eval/miri contexts?

AngelicosPhosphoros · 2023-05-20T12:35:01Z

I started implementing intrinsic: #111744

scottmcm · 2023-05-20T17:55:54Z

I think #97712 suggests that it's a good thing that MIRI runs on the block optimization version, so we wouldn't want to hide it, now that MIRI supports it.

cuviper · 2023-06-05T22:10:47Z

It's not clear to me -- is the swap intrinsic meant to fully replace this PR? Please @rustbot author if you're focusing on that direction.

…iler-errors Use `load`+`store` instead of `memcpy` for small integer arrays I was inspired by rust-lang#98892 to see whether, rather than making `mem::swap` do something smart in the library, we could update MIR assignments like `*_1 = *_2` to do something smarter than `memcpy` for sufficiently-small types that doing it inline is going to be better than a `memcpy` call in assembly anyway. After all, special code may help `mem::swap`, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like `mem::replace`, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments. LLVM will turn the short, known-length `memcpy`s into direct instructions in the backend, but that's too late for it to be able to remove `alloca`s. In general, replacing `memcpy`s with typed instructions is hard in the middle-end -- even for `memcpy.inline` where it knows it won't be a function call -- is hard [due to poison propagation issues](https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/memcpy.20vs.20load-store.20for.20MIR.20assignments/near/360376712). So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to `mem2reg` away the `alloca`s in some situations. rust-lang#52051 previously did something like this in the library for `mem::swap`, but it ended up regressing during enabling mir inlining (rust-lang@cbbf06b), so this has been suboptimal on stable for ≈5 releases now. The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the [`LayoutTypeMethods`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/trait.LayoutTypeMethods.html) trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them. I don't want to try to bite off too much in this PR, though. (Transparent newtypes and simple things like the 3×usize `String` would be obvious candidates for a follow-up.) Codegen demonstrations: <https://llvm.godbolt.org/z/fK8hT9aqv> Before: ```llvm define void `@swap_rgb48_old(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 { %a.i = alloca [3 x i16], align 2 call void `@llvm.lifetime.start.p0(i64` 6, ptr nonnull %a.i) call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false) tail call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false) call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false) call void `@llvm.lifetime.end.p0(i64` 6, ptr nonnull %a.i) ret void } ``` Note it going to stack: ```nasm swap_rgb48_old: # `@swap_rgb48_old` movzx eax, word ptr [rdi + 4] mov word ptr [rsp - 4], ax mov eax, dword ptr [rdi] mov dword ptr [rsp - 8], eax movzx eax, word ptr [rsi + 4] mov word ptr [rdi + 4], ax mov eax, dword ptr [rsi] mov dword ptr [rdi], eax movzx eax, word ptr [rsp - 4] mov word ptr [rsi + 4], ax mov eax, dword ptr [rsp - 8] mov dword ptr [rsi], eax ret ``` Now: ```llvm define void `@swap_rgb48(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 { start: %0 = load <3 x i16>, ptr %x, align 2 %1 = load <3 x i16>, ptr %y, align 2 store <3 x i16> %1, ptr %x, align 2 store <3 x i16> %0, ptr %y, align 2 ret void } ``` still lowers to `dword`+`word` operations, but has no stack traffic: ```nasm swap_rgb48: # `@swap_rgb48` mov eax, dword ptr [rdi] movzx ecx, word ptr [rdi + 4] movzx edx, word ptr [rsi + 4] mov r8d, dword ptr [rsi] mov dword ptr [rdi], r8d mov word ptr [rdi + 4], dx mov word ptr [rsi + 4], cx mov dword ptr [rsi], eax ret ``` And as a demonstration that this isn't just `mem::swap`, a `mem::replace` on a small array (since replace doesn't use swap since rust-lang#83022), which used to be `memcpy`s in LLVM changes in IR ```llvm define void `@replace_short_array(ptr` noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 { start: %1 = load <3 x i32>, ptr %r, align 4 store <3 x i32> %1, ptr %0, align 4 %2 = load <3 x i32>, ptr %v, align 4 store <3 x i32> %2, ptr %r, align 4 ret void } ``` but that lowers to reasonable `dword`+`qword` instructions still ```nasm replace_short_array: # `@replace_short_array` mov rax, rdi mov rcx, qword ptr [rsi] mov edi, dword ptr [rsi + 8] mov dword ptr [rax + 8], edi mov qword ptr [rax], rcx mov rcx, qword ptr [rdx] mov edx, dword ptr [rdx + 8] mov dword ptr [rsi + 8], edx mov qword ptr [rsi], rcx ret ```

AngelicosPhosphoros · 2023-06-11T18:21:54Z

@cuviper Yes, I think, it would replace it fully if it would be merged.

And it provides more noticable benefits than this PR.

I would make this PR a draft until then.

cuviper · 2023-06-16T00:20:56Z

@rustbot author

…<try> Tweak the threshold for chunked swapping Thanks to `@AngelicosPhosphoros` for the tests here, which I copied from rust-lang#98892. This is an experiment as a simple alternative to that PR that just tweaks the existing threshold, since that PR showed that 3×Align (like `String`) currently doesn't work as well as it could.

…Mark-Simulacrum Tweak the threshold for chunked swapping Thanks to `@AngelicosPhosphoros` for the tests here, which I copied from rust-lang#98892. This is an experiment as a simple alternative to that PR that just tweaks the existing threshold, since that PR showed that 3×Align (like `String`) currently doesn't work as well as it could.

Dylan-DPC · 2024-02-03T17:43:48Z

@AngelicosPhosphoros any updates on this?

Dylan-DPC · 2024-03-12T07:14:23Z

Closing this as inactive. Feel free to reöpen this pr or create a new pr if you get the time to work on this. Thanks

rust-highfive assigned scottmcm Jul 4, 2022

rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jul 4, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 4, 2022

This comment was marked as outdated.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 4, 2022

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jul 5, 2022

JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 13, 2022

JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 8, 2022

rustbot assigned cuviper and unassigned scottmcm Jan 14, 2023

cuviper requested changes Feb 20, 2023

View reviewed changes

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 3, 2023

AngelicosPhosphoros force-pushed the make_unaligned_chunked_swap_for_x64 branch from 4c96092 to 882407a Compare May 2, 2023 17:22

AngelicosPhosphoros changed the title ~~[WIP] Make unaligned chunked swap for x64~~ Make unaligned chunked swap for x64 May 2, 2023

rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 2, 2023

This comment has been minimized.

Sign in to view

rustbot removed S-waiting-on-perf Status: Waiting on a perf run to be completed. perf-regression Performance regression. labels May 16, 2023

AngelicosPhosphoros force-pushed the make_unaligned_chunked_swap_for_x64 branch from 18c41fa to 4607601 Compare May 16, 2023 10:17

scottmcm mentioned this pull request May 20, 2023

Tweak the threshold for chunked swapping #111803

Merged

scottmcm mentioned this pull request May 26, 2023

Use load+store instead of memcpy for small integer arrays #111999

Merged

AngelicosPhosphoros mentioned this pull request Jun 2, 2023

Implement intrinsic for swapping values #111744

Closed

AngelicosPhosphoros marked this pull request as draft June 11, 2023 18:22

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 16, 2023

Dylan-DPC closed this Mar 12, 2024

Dylan-DPC added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Use unaligned read/writes for `core::mem::swap` on x86_64 #98892

[WIP]Use unaligned read/writes for `core::mem::swap` on x86_64 #98892

AngelicosPhosphoros commented Jul 4, 2022 •

edited

Loading

rustbot commented Jul 4, 2022

This comment was marked as outdated.

compiler-errors commented Jul 4, 2022

rust-timer commented Jul 4, 2022

bors commented Jul 4, 2022

bors commented Jul 4, 2022

rust-timer commented Jul 4, 2022

rust-timer commented Jul 5, 2022

scottmcm commented Jan 14, 2023

cuviper left a comment

cuviper Feb 20, 2023

AngelicosPhosphoros May 2, 2023

cuviper commented Mar 3, 2023

AngelicosPhosphoros commented May 2, 2023 •

edited

Loading

bors commented May 16, 2023

bors commented May 16, 2023

This comment has been minimized.

rust-timer commented May 16, 2023

AngelicosPhosphoros commented May 16, 2023

AngelicosPhosphoros commented May 20, 2023

scottmcm commented May 20, 2023

cuviper commented Jun 5, 2023

AngelicosPhosphoros commented Jun 11, 2023

cuviper commented Jun 16, 2023

Dylan-DPC commented Feb 3, 2024

Dylan-DPC commented Mar 12, 2024

[WIP]Use unaligned read/writes for core::mem::swap on x86_64 #98892

[WIP]Use unaligned read/writes for core::mem::swap on x86_64 #98892

Conversation

AngelicosPhosphoros commented Jul 4, 2022 • edited Loading

rustbot commented Jul 4, 2022

This comment was marked as outdated.

compiler-errors commented Jul 4, 2022

rust-timer commented Jul 4, 2022

bors commented Jul 4, 2022

bors commented Jul 4, 2022

rust-timer commented Jul 4, 2022

rust-timer commented Jul 5, 2022

Footnotes

scottmcm commented Jan 14, 2023

cuviper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cuviper commented Mar 3, 2023

AngelicosPhosphoros commented May 2, 2023 • edited Loading

bors commented May 16, 2023

bors commented May 16, 2023

This comment has been minimized.

rust-timer commented May 16, 2023

Overall result: no relevant changes - no action needed

AngelicosPhosphoros commented May 16, 2023

AngelicosPhosphoros commented May 20, 2023

scottmcm commented May 20, 2023

cuviper commented Jun 5, 2023

AngelicosPhosphoros commented Jun 11, 2023

cuviper commented Jun 16, 2023

Dylan-DPC commented Feb 3, 2024

Dylan-DPC commented Mar 12, 2024

[WIP]Use unaligned read/writes for `core::mem::swap` on x86_64 #98892

[WIP]Use unaligned read/writes for `core::mem::swap` on x86_64 #98892

AngelicosPhosphoros commented Jul 4, 2022 •

edited

Loading

AngelicosPhosphoros commented May 2, 2023 •

edited

Loading